Sree Hella A 


crpnch 148/19 £7 


bere CL | 


VECTORIZATION 
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VECTORIZATION 


_ Techniques used 


e Simple vectorization 
e Strip mining 
e Loop distribution 


e Loop interchange 


THE 
FORTRAN 


COMPILER 


MAJOR FEATURES 


e Conforms to ANSI Fortran-77 Standard 
(ANSI X3.9-1978) | 


° Optional Support of Fortran-66 features 
(ANSI X3.9-1966) 


© VAX/ VMS compatibility 


e Interfaces to the symbolic debugger, 


CSD 
© 
« Interfaces to the performance analy zer 


- Integrated into the CONVEX/UNIX 
environment 


CONVEX EXTENSIONS 


e Hollerith constants can be used where 
a character value is expected 


e Symbolic names may be of arbitrary length 
e Variety of data types 

e Logical *1, *2, *4, *8 

e Integer *1, *2, *4, *8 

e Real *4, *8 

e Complex *8, *16 


e Character *len 


@ 
Differences C-1/VAX Fortran 


Does not support certain VAX extensions 


° INCLUDE statement 

«e NAMELIST statement 
e REAL*16 data type 

« %DESCR 


e Byte ordering w.r.t. characters and 
parameter passing 


e Low order bit test for Is 


e Numerical diferences, due to floating-point 
representation and rounding method 


e ‘cc...c‘ form of octal constants is typeless 


” Differences C-1/VAX Fortran 


Does not support certain VAX I/O extensions 
« DELETE, UNLOCK statements 
° NAMELIST directed I/O 
e Variable format expressions 
e Indexed I/O (key-indexed files) 
e File sharing 
@ ses. DEFINEFILE statement 
e Certain OPEN keywords 
e Certain CLOSE keywords 
e Record specifier ‘r 
» ASCII null as carriage control 
e RECL keyword on OPENs is in — 
e Certain internal record formats differ 
e SEGMENTED record type 


e No RMS related extensions 


Differences C-1/VAX Fortran 


Miscellaneous 


e User takes advantage of implementation 
e mechanism to pass arguments 

e User takes advantage of VAX architecture 
e representatiojn of character strings 

e No VMS OS specific extensions 


e pathnames 
e calling system services 


e No PDP-11 Fortran compatibility 


SCALAR 


OPTIMIZATION 


SCALAR OPTIMIZATION 


2 Types of Optimization are Provided 


e Local 


e Global 


{ & 
; 


SCALAR OPTIMIZATION 


® Assignment substitution 

e Redundant assignment elimination 

e Redundant use elimination 

e Redundant subexpression elimination 
e Tree height reduction 

e Constant propagation and folding 


Dead code elimination 


e Code motion 

e Strength reduction 

e Instruction scheduling 
e Branch optimization 


e Register allocation 


@ 
- Assignment Substitution 


Code: 


@ Becomes: 


© - Assignment Substitution 


Unoptimized: 


load 
add 
stor 
Idor 
laod 
add 


stor 


Optimized: 


load 
load 
add 
add 
stor 
stor 


@ 12(ap),s0 

# 0x41400000,s0 
s0,@ 8(ap) 

@ 8&(ap),s1 

@ 4(ap),s2 

s2,sl1 

s1,@ O(ap) 


@ 12(ap),sO 
@ 4(ap),sl 


# 0x41400000,s0 


sO,sl 


_s0,@ 8(ap) 


s1l,@ O(ap) 


. Assignment Substitution 


Source Code: 


subroutine assgn (d,e,x,z) 


real d,e,x,z 


x=2z+3 
— x +e 
return 
end 
Unoptimized: 
L2: ; Stmt _L1 


Id.w  @12(ap),sO 
add.s #0x41400000,s0 
st.w s0,@ 8(ap) 
sstabd Ox44,0,4 
L3: ; Stmt _L2 
Id.w = @ &(ap),sl1 
Id.w  @ 4(ap),s2 
add.s_ s2,s1 
st.w s1,@ O(ap) 
stabd Ox44,0,5 
LA: ; Stmt _L3 
rtn -3#5 


] 


> # 


j 


P] 
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- Assignment Substitution 


Source Code: 


subroutine assgn (d,e,x,z) 


real d,e,x,z 


x=2Z7+3 
d—x-cte 
return 
end 
Optimized: 
_assgn_: 


Id.w @ 12(ap),sO 
Id.w @ 4(ap),s1 
add.s #0x41400000,s0 
add.s__ sO,sl 

st.w s0,@ 8(ap) 

st.w s1,@ O(ap) 

rtn ; #5 


| Sg 


¢ Redundant Assignment Elimination 


Code: 


eres 


x=et+f*g¢g 
& Becomes: 


x=—=e+f*g 


- Redundant Assignment Elimination 


Source Code: | 


subroutine redasg(x,a,b,e,f,g) 
real x,a,b,e,f,¢ 


x =a+b 
x = et+f*g 
return 
end 

Unoptimized: 

e L2: ; Stmt _L1 

Id.w @ 4(ap),sO 3;#3,A 
ld.w @ 8(ap),si ;#3,B 
add.s__ si,s0 3 #3 
st.w s0,@ O(ap) 7 #3, X 
stabd 0Ox44,0,4 

L3: ; Stmt _L2 
Id.w @ 16(ap),s2 ;#4,F 
Id.w @ 20(ap),s3 3;#4,G 
Id.w @12(ap),s4 5#4,E 
mul.s s3,s2 3 #4 
add.s  s2,s4 ; #4 
st.w _—-s4,@ O(ap) 3#4, xX 
stabd 0x44,0,5 


LA: ; Stmt _L3 
@ rtn 3; #5 


- Redundant Assignment Elimination 


Source Code: 


subroutine redasg(x,a,b,e,f,g) 
real x,a,b,e,f,g 


x = a+b 
x =etf*g 
return 
end 

Optimized: 

@ 

_redasg_: 
ld.w @ 16(ap),sO ;#4,F 
Id.w @ 20(ap),s1 3#4,G 
Id.w  @ 12(ap),s2 ;#4,E 
mul.s  sl,sO ; #4 
add.s_ s0,s2 ; #4 
st.w s2,@ O(ap) 3#4, X 
rtn 3; #5 


ha 2 


« Redundant Assignment Elimination 


Source: 
program test 
x= yz 
a =O 
if(a.gt.0) then 
am=x"*y+, 
else 
x =a-b*c 
endif 
end 


Becomes: 


program test 
end 


; INSTRUCTIONS 


etext 

ds.w 0x1040b07 

ds.b 7-O2 ” 

eglobl _MAIN__ 
_MAIN_: 

rtn ; #9 


6 
e Redundant Use Elimination 


Source Code: 


subroutine reduse(x,a,b,c,d) 
real x,a,b,c,d 


a= x™b 
c = x+d 
return 
end 

Unoptimized: 

e L2: ; Stmt _L1 

Id.w @ O(ap),sO a are, a 
Id.w @ 8(ap),sl1 ;#3,B 
mul.s  si,sO 3 #3 
st.w sO,@ 4(ap) 5#3,A 
wstabd 0x44,0,4 : 

L3: ; Stmt _L2 
Id.w — @ O(ap),s2 ;#4, xX 
Id.w @ 16(ap),s3 ;#4,D 
add.s_ s3,s2 3 #4 
st.w s2,@ 12(ap) ;#4,C 
sstabd 0x44,0,5 

L4: ; Stmt _L3 


rtn ; #5 


| @ 


« Redundant Use Elimination 


Source Code: 


subroutine reduse(x,a,b,c,d) 
real x,a,b,c,d 


a = xb 
c = x+d 
return 
end 

Optimized: 

_reduse_: 
Id.w @ O(ap),sO 33, X 
Id.w @ 16(ap),s1 ;#4,D 
Id.w - @ 8(ap),s2 ;#3,B 
add.s__ s0,s1 ; #4 
mul.s  s2,s0 ;#3 
st.w sl,@ 12(ap) ;#4,C 
st.w s0,@ 4(ap) ;#3,A 
rtn 3; #5 | 


« Redundant Subexpression Elimination 


Code: 


Becomes: 
t=xt+y 
at 


b=—=c*t 


@ « Redundant Subexpression Elimination 


Source Code: 


subroutine redsub(x,y,a,b,c) 
real x,y,a,b,c,d 


a= x+y 
b = c*(x+y) 
return 
end 

Unoptimized: 

®@ iz: ‘Ste Ti 

Id.w  - @ O(ap),sO 5 #3, X 
Id.w @ 4(ap),s1l se Oo 
add.s_sl,sO ;#3 
st.w s0,@ 8(ap) 5#3,A 
stabd Ox44,0,4 

L3: ; Stmt _L2 ; 
ld.w @ O(ap),s2 3#4,X 
ld.w @ 4(ap),s3 ;#4, Y 
Id.w  @16(ap),s4 ;#4,C 
add.s___s3,s2 ; #4 
mul.s s2,s4 5 #4 
st.w s4,@ 12(ap) ;#4,B 
stabd 0x44,0,5 

L4: ; Stmt _L3 


© rtn 5 #5 


« Redundant Subexpression Elimination 
Source Code: 


subroutine redsub(x,y,a,b,c) 
real x,y,a,b,c,d 


a==x+y 
b = c*(x+y) 
return 
end 

Optimized: 

d _redsub_: 

Id.w @ O(ap),sO ;#3, xX 
Id.w @ 4(ap),sl 3; #3, Y 
Id.w @ 16(ap),s2 ;#4,C 
add.s__ si,sO 5#3 
mul.s s0,s2 5 #4 
st.w s0,@ 8(ap) 5#3,A 
st.w _s2,@ 12(ap) ;#4,B 


rtn 3 #95 


- Constant Propagation and Folding 


Code: 
5 
== 0 
JJ? 


Becomes: 
i= 
j=o0 
j=2 
n—k+410 


- Constant Propagation and Folding 


(other work occurs, but 
value of j is not changed) 


j=j+2 


n=—k-+ 5 * 4 
Becomes: 
j=o0 


(other work occurs, but 
value of j is not changed) 


- Constant Propagation and Folding 


Code: 


25 
b = 15 
if(i.le.0) then 
a6 
c=2 
else — 
c—=a-+b 
endif 


@ Becomes: 


aond 3 
b= 15 
if(i.le.O) then 
a==6 
c=6 
else 
c==20 
endif 


ad - Constant Propagation and Folding 


Source Code: 


subroutine constp1(n,i,j,k) 
integer i,j,k 


i= 5 
j = 0 
n = k+i*j 
return 
end 

Unoptimized: 

@ L2: 3; Stmt L1 

Id.w #0x0000005,s0 3 #3 
st.w s0,@ 4(ap) ; #3,1 
stabd 0x44,0,4 

L3: 3; Stmt _L2 

Id.w #0x0000000,s1 3 #4 

st.w 31,@ 8(ap) 3 #4, J 

-stabd 0x44,0,5 

L4: 3; Stmt _L3 . 
Id.w @ 8(ap),s2 3 #5, J 
add.w #0x0000002,s2 3 #95 
st.w s2,@ 8(ap) 3 #5, J 
.stabd 0x44,0,6 

L5: 3; Stmt L4 
Id.w @ 4(ap),s3 ; #6, I 
Id.w @ 8(ap),s4 3; #6, J 
Id.w @ 12(ap),s5 3; #6, K 
mul.w s4,33 3; #6 
add.w s3,s5 3; #6 
st.w s5,@ 0(ap) 3; #6,N 
.stabd 0x44,0,7 


L6: 3; Stmt _L5 
a rtn Gg #T 


a 


e : Constant Propagation and Folding 


Source Code: 


subroutine constp1(n,i,j,k) 
integer i,j,k 


15 
110 

j = jt+2 

n = k+i*j 
return 
end 


© Optimized: 


_constp1_: 

Id.w # O0x0000002,s1 3 #5 
Id.w # 0x0000005,s0 5 #3 
Id.w @12(ap),s2 3#6,K 
add.w #0x000000a,s2 ; #6 
st.w sl,@ 8(ap) ;#5, J 
st.w s0,@ 4(ap) ;#3,1 
st.w s2,@ O(ap) ;#6,N 


rtn ;#7 


® 
- Tree Height Reduction 


Gade: 
a+b -e+ dei te+h 


Usual: 


(a + (b + (c + (d +(e + (f+ (g + h))))))) 


Becomes: 


(((a + b) + (¢ + d)) + (e+ f) + (g + h))) 


- - Tree Height Reduction 


Source Code: 


subroutine treehgt(x,a,b,c,d,e,f,g,h) 
real x,a,b,c,d,e,f,g,h 


x—=a+tbte+d+e+f+g+h 


return 
end 
Optimized: 
_treehgt_: 

| Id.w @ 4(ap),sO ;#3,A 

| & Id.w @ 8(ap),sl ;#3,B 
Id.w @ 12(ap),s2 ;#3,C 
Id.w @ 16(ap),s3 ;#3,D 
Id.w @ 20(ap),s4 :#3, E 
Id.w @ 24(ap),s5 ; #3, F 
Id.w @ 28(ap),s6 ;#3,G 
Id.w @ 32(ap),s7 ;#3,H 
add.s_ __ si1,s0 :#3 
add.s s3,s2 ;#3 
add.s 55,84 ;#3 
add.s s7,s6 ;#3 
add.s s2,s0 ;#3 
add.s s6,s4 ;#3 
add.s s4,s0 3; #3 
st.w s0,@ O(ap) 3 #3, X 
rtn ;#4 


e Dead Code Elimination 


Code: 


logical t 

t = .true. 

if(t) then 
print *,a,b 

else | 
x = x*4.0 

endif 


end 


Becomes: 


logical t 
print *,a,b 
end 


« Dead Code Elimination 


program ddcde 
real a,b : 
logical t 
a= 4.3 
b = 5.2 
t = .true. 
if(t) then 
print *,a,b 
else _ 
x == x*4.0 
endif 


end 


; INSTRUCTIONS 


_MAIN_: 


etext 

ds.w 0x1060000 
ds.b ROL” 
elobl _MAIN_ 
Idea LC+24,ap 
calls _for$s_wsle 
Idea LC+44,ap 
calls _for$do_lio 
Idea LC+76,ap 
calls _for$do_lio 
Idea LC+96,ap 
calls _for$e_wsle 


rtn 3; #0 


; #8, ?LO 
; #8, for$s_wsle 
: #8, 7LC 
; #8, for$do_lio 
; #8, PLC 
; #8, for$do_lio 
: #8, 7LC 
; #8, for$e_wsle 


ue < 
| & 


e Code Motion | 


Code: | 
do i=1,100 
a =b + (c*4)/(d +c) 
ar(i) —=a-+d**2 
enddo 
Becomes: 


a =b + (c*4)/(d 4+ c) 
tl =a-+ d**2 


do i=1,100 
ar(i) = tl 
enddo 


@ « Code Motion 


Source Code: 


subroutine motion(a,b,c,d,ar) 
dimension ar(100) 
do i=1,100 
a=b-+ (c*4)/(d +c) 
ar(i) =a + d**2 
enddo 
return 
end 


e os Code Motion 


Unoptimized: 


L3: 


L4: 


L5: 


L6: 


L7: 


Id.w 
st.w 


-stabd 


Id.w 
Id.w 
Id.w 
add.s 
mov.w 
mul.s 
div.s 
add.s 
st.w 
stabd 


ld.w 
ld.w 
ld.w 
Id.w 
mul.s 
shf — 
add.w 
add.s 
st.w 


ld.w 
add.w 
st.w 


Id.w 
It.w 
jbrs.f 
-stabd 


rtn 


3; Stmt _L1 
# 0x0000001,s0 
s0,LU 
0x44,0,4 

3; Stmt -1 
@ 8(ap),s1 
@ 12(ap),s3 
@ 4(ap),s4 
81,33 

s1,s2 
#0x41800000,s2 
33,32 

s2,34 

s4,@ O(ap) 
0x44,0,5 

3; Stmt _L2 
@ 12(ap),s5 
@ O(ap),s6 
LU,al 
16(ap),a2 
85,35 


” €0x0000002,a1 


a2,al 

s5,s6 

s6,-4(a1) 

3; Stmt _L3 
LU,s7 

# 0x0000001,37 
s7,LU 

3; Stmt _L4 
LU,s0 
#0x0000064,s0 
L3 

0x44,0,7 

3; Stmt _L5 

3 #7 


sd e Code Motion 


Optimized: 


_motion_: 


L3: 


sub.w 
ld.w 
Id.w 
Id.w 
Id.w 
Id.w 
ld.w 
sub.w 
mov.w 
mul.s 
add.s 
mov.w 
mul.s 
mov 
sub.w 
div.s 
add.s 
add.s 
st.w 
st.w 
st.w 
st.w 
st.w 


Id.w 
Id.w 
Id.w 


mov 
add.w 
le.w 
st.w 
jbra.t 
st.w 
rtn 


#0x0000020,sp 
#0x0000004,s5 
#0x0000190,s6 
@ 8(ap),s0 

@ 4(ap),s3 
@12(ap),s2 
16(ap),al 

s6,s5 

sO,sl1 


#0x41800000,s1 


s2,s0 
82,84 
s2,s4 
al,s7 


* 5,87 


s0,s1 
s1,s3 
s3,s4 
al,-4(fp) 
37 ,-8(fp) 
$3,@ 0(ap) 
s3,-24(fp) 
34,-32(fp) 
3; Stmt -1 
~4(fp),a2 
~8(fp),a4 
~32(fp),sO 


a2,a3 
#0x0000004,a2 


@ 


- Strength Reduction 


Code: 


ee | 
10 J Sr k 
i = i+2 
if(i.le.100) go to 10 
Becomes: 
ti. =k 
t2 = 100*k 
to = 2*k 
10 je=t1 
tl = t14+t3 


if(t1.le.t2) go to 10 


@ 


e Strength Reduction 


Source Code: 


subroutine gstren(j,k) 
i=l 
10 iS i*k 
i= i+2 
if(i.le.100) go to 10 
return 
end 


- Strength Reduction 


; INSTRUCTIONS 


_gstren_: 


etext 
ds.w 
ds.b 
-globl 


sub.w 
ld.w 
mov.w 
mul.w 
st.w 
st.w 
shf 
st.w 


Id.w 


ld.w 
Id.w 
add.w 
le.w 
st.w 
st.w 
jbrs.t 
rtn 


0x1040b07 
9 -~O 2 33 
_gstren_ 


# 0x0000010,sp 
@ 4(ap),s0 
s0,sl1 

# 0x0000064,s1 
s0,-4(fp) 
s1,-8(fp) 

# 0x0000001,s0 
s0,-12(fp) 

; Stmt 10 
-8(fp),s4 


-12(fp),s3 
-4(fp),s2 
$2,s3 
s3,s4 

s2,@ O(ap) 


' 83,-4(fp) 


M3 
; #6 


| @ 


» Strength Reduction 


Source Code: 


subroutine gstren(j,k) 
j =< 
10 j = i*k 
i = i+2 
if(i.le.100) go to 10 
return 
end 


- Strength Reduction 


Unoptimized: 


L2: 


L3: 


L4: 


L5: 


L68: 


Id.w 
st.w 
stabd 


Id.w 
Id.w 
mul.w 


st.w 
stabd 


ld.w 
add.w 
st.w 
stabd 


Id.w 
lt.w 
jbrs.f 
stabd 


rtn 


3; Stmt _L1 
#0x0000001,s0 
s0,LU 
0x44,0,3 

3; Stmt 10 
LU,sl 

@ 4(ap),s2 
s2,sl 

s1,@ 0(ap) 
0x44,0,4 


* 3 Stmt _L2 


LU,s3 
#0x0000002,s3 
33,LU 
0x44,0,5 

3; Stmt _L3 
LU,s4 
#0x0000064,s4 
L3 

0x44,0,6 

3; Stmt _L4 

3 #6 


- Strength Reduction 


Optimized: 


-gstren_: 

sub.w 
Id.w 
Id.w 
st.w 
st.w 
shf 
st.w 


Id.w 
ld.w 
Id.w 
L3: 

add.w 
st.w 
add.w 
lt.w 
jbrs.f 
st.w 
st.w 
rtn 


#0x0000008,sp 
#0x0000001,s0 
@ 4(ap),s1 


* 50,LU 


s1,-8(fp) 
#0x0000001,s1 
s1,-4(fp) 

; Stmt 10 
LU,s2 

-8(fp),s3 
-4(fp),s4 


#0x0000002,s2 
s3,@ 0(ap) 
s4,s3 
#0x0000064,s2 
L3 


VECTORIZATION 


Advantages of Vector Processing 


_e Eliminates overhead associated with 


loop control 


e Reduces loops to a simple sequence of 
instructions 


Vector Stride 


e Contiguous or Unity 


A(D) = BI) 
10 CONTINUE 


« Constant — 


- DO 101 = 1, 6,2 
A(I) = B(D) 
10 CONTINUE 


« Random 
« DO 101 — 1, 100 


A(T) = B(INDEX(D) 
10 CONTINUE 


VECTORIZATION 
CAPABILITIES 


e Simple loops 

e Strip mining 

e Loop interchange 

e Loop distribution 

° Recognize reduction operators 
e Generate vector of indices 

e Perform scalar expansion 

e Vectorize conditionals 

e Partial vectorization © 

e Detect loop iteration count 


e Accept all data types 


| & 


- Simple Vectorization 


Source: 


do 10i—1, 100° 
a(i) = POs + e(i) 


10 continue 


Scalar Instructions - Full Optimization 


Ms: 


Id.w 
mov 


add.w 
add.w 


add.w 
add.w 
add.d 
lt.w 
st. 
jbra.f 
st.w 
rtn 


# 0x0000008,sp 
# 0x0000000,s0 
sO,-4(fp) 
; Stmt -1 
-4(fp),a1 


4(ap),a3 
8(ap),a4 
O(ap),a5 

al,a2 

a2,a3 

0(a3),s1 

a2,a4 

O(a4),s2 

# 0x0000008,a1 
abd,a2 

s2,sl 

# 0x0000318,a1 
s1,0(a2) 

M3 — 

al,-4(fp) 


e 
3 


Vector Instructions 


_simple_: 
Id.w 
ld.w 
ld.w 
ld.w 
ld.w 
ld.1 
ld.1 
add.d 
st.] 
rtn 


# 0x0000064,v1 
# Ox0000008,vs 
A(ap),al 
8(ap),a2 
O(ap),a3 
O(al),vO 
O(a2),v1 
v0O,v1,v2 
v2,0(a3) 

;#7 


¢ Simple Vectorization 


6 
Source: 


subroutine simp(a,b,c) 
real a(100),b(100),c(100) 
do 10 i 1, 100 
a(i) = b(i) + c(i) 

10 continue 
return 
end 


Compiled: 


subroutine simp(a,b,c) 
real a(100),b(100),c(100) 

C###3 [fc] Loop on line 3 of t1.f (DO I) fully vectorized%%% 
do 10 i = 1, 100 
a(i) == b(i) + c(i) 

10 continue 

return 
end 


e.. Simple Vectorization 


Source: 


_ subroutine simp(a,b,c,n) 
real a(100),b(100),c(100) 
integer n 
do 10i—I1,n 
av bG) eG)" 

10 continue 
return 
end 


| @ Compiled: 


subroutine simp(a,b,c,n) 
real a(100),b(100),c(100) 
integer n 
###4 [fc] Loop on line 4 of t1.f (DO I) fully vectorized%%% 
 dol0i~—i,n 
a(i) = b{i) + c(i) 
10 continue’ 
return 
end 


| e 


Simple Vectorization 


Source: 


subroutine simp(a,b,c,n) 
real a(100),b(100),c(100) 
integer n 
do 10i—1I1,n, 2 
a(i) = b(i) ee c(i) 
10 continue 
return 
end 


@ Compiled: 


subroutine simp(a,b,c,n) 
real a(100),b(100),c(100) 
integer n 
C###4 [fc] Loop on line 4 of t1.f (DO I) fully vectorized%%% 

do 10i—1,n, 2 
a(i) = b(i) + c(i) 

10 continue 
return 
end 


- Strip Mining 


Code: 


do 10i~—1I1,n 


a(i) = b(i) * c(i) 


10 continue 
Becomes: 


j=0 

do 20 lv — n, O, -128 

do 10 i — 1, min(128,lv) 

a(it+j) = b(i+j) * c(i+j) 
10 continue 

j=j+128 


20 continue 


e « Loop Interchange 


Code: 


do 201 —I1,n 
do 10 j =1,m 
a(i,j) ae b(i,j) . c(i,j) 
10 continue 
20 continue 


Becomes: 


@ do 20j=—1I1,n 
do 101 —1,m 
a(i,j) =F b(i,j) . c(i,j) 
10 continue 
20b continue 


« Loop Interchange 


~ Source: 


subroutine tl (a,b,c,n,m) 
real a(n,m), b(n,m), c(n,m) 
integer i, j, n, m 
do 20i—1,n 
do10j—=1,m 
a(i,j) — b(i,j) i c(i,j) 

10 continue 

20 continue 
return 
end 


Compiled: 


subroutine t1 (a,b,c,n,m) 
real a(n,m), b(n,m), c(n,m) 
integer 1, j, n, m 
Cz # # 4 [fc] Loop on line 4 of t1.f (DO J) fully vectorized? 
Cz ¢ #4 [fc] Loop on line 4 of t1.f (DO I) interchanged to 
be innermost loop of nest %% 7 
do 20i—I,n 
do1l0j—1,m 
a(i,j) = b(i,3) * ¢(i,j) 
10 continue 
20 continue 
return 
end 


© 
« Loop Distribution 


Code: 


a(i) = —— a(i) is b(i,j) = c(i,j) 
10 continue 
20 continue 


& Becomes: 


do 20ai—I1,n 
be) —0 


20a continue 


do 20bi~—I1,n 

do10 j=—1,m 

a(i) = a(i) + b(i,j) * c(i,j) 
10 continue 
20b continue 


« Loop Distribution 


Compiled: 


subroutine t1 (a,b,c,n,m) 
real a(n), b(n,m), c(n,m) 
integer i, Jj, nm, m 
C# # # 4 [fc] Loop on line 4 of t1.f (DOT) 
(distributed loop # 2) fully vectorized%%% 
Cz ¢ #4 [fc] Loop on line 4 of t1.f (DO J) interchanged to 
be innermost loop of nest%%% 
Cz # #4 [fc] Loop on line 4 of t1.f (DO I) 
(distributed loop #1) fully vectorized %%% 
C# # # 4 [fc] Loop on line 4 of t1.f (DO I) distributed, 
_ forming 2 loops%%% 
do 20i—1,n 
b(i,1) = 0 
do10j—1,m 
a(i) = a(i) + b(i,j) * c(i,j) 
10 continue 
20 continue 
return 
end 


Ld « Loop Distribution 


Source: 


do 20i—I1,n 
b(i,1) = 0 
do 10j =1,m 
a(i) = a(i) + b(i,j) * c(i,j) 
10 continue 
— d(i) = e(i) + a(i) 


20 continue 


Becomes: 


| & 


do 20ai—I1,n 
b(i,1) = 


20a continue 


‘eo ) 


do 20bi~—1I1,n 

do 10 j=—1,m 

a(i) —a(i) + bG,5) * efi,5) 
10 continue 
20b continue 


do 20ci~—I1,n 
d(i) = e(i) + a(i) 


20c continue 


e Reduction Operators 


Maximum: 


do 10 i — 1, 100 
t1 = max(tl,a(i)) 
10 continue 


Minimum: 
do 20 j = 1, 100 


t3 = min(t3,b(j)) 
20 continue 


Sum: 


do 30 j — 1, 100 
t2 = t2 + b(j) 


30 continue 
Product: 
do 40 i — 1, 100 


t3 = 3 * c(i) 
40 continue 


Reduction Operators 


Source: 


subroutine redc (a,b,c,t1,t2,t3) 
real*8 a(100), b(100), c(100), t1, +2, +3 
integer*4 i, j 
C###4 [fc] Loop on line 4 of redc.f (DO I) fully vectorized%%% 

do 10 i — 1, 100 
tks max(ti,a(i)) 

10 continue 

###7 [fc] Loop on line 7 of redc.f (DO J) fully vectorized%%% 
do 20 j = 1, 100 
t3 = min(t3,b(j)) 

20 continue 

###10 [fc] Loop on line 10 of redc.f (DO J) fully vectorized%%% 
do 30 j = 1, 100 
t2 = t2 + b(j) 

30 continue 

C###13 [fc] Loop on line 13 of redc.f (DO I) fully vectorized%%% 

do 40 i= 1, 100 
t3 = t3 * c(i) 

40 continue 
return 
end 


_redc_: 


id-w 
Id.w 


; Id.w 


1d.1 
ld.] 
max.d 
st.l 


-. Reduction Operators 


# 0x0000064,v1 
# O0x0000008,vs 
O(ap),al 

@ 12(ap),sO 
O(al),vO 

vO 

s0,@ 12(ap) 

# 0x0000064,v1 
# 0x0000008,vs 
4(ap),a2 

@ 20(ap),sl 
O(a2),v1 

vl 

s1,@ 20(ap) 

# 0x0000064,v1 
# 0x0000008,vs 
4(ap),a3 

@ 16(ap),s2 
O(a3),v2 

v2 

s2,@ 16(ap) 

# 0x0000064,v1 
# 0x0000008,vs 
8(ap),a4 

@ 20(ap),s3 
0(a4),v3 


v3 


s3,@ 20(ap) 
; # 20 


;#10,B 
; #10, DMIN1 
; #10, T3 
; #14 

; #14 
;#14,B 
; #14, T2 
;#14,B 
; #14 

3; #14, T2 
; #18 

; #18 
;#18,C 
; #18, T3 
;#18,C 
; #18 

; #18, T3 


« Vector of Indices 


Source: 


subroutine iotafn(x) 
real x(100) 
do i=1,100 
x(i) =i 
enddo 
return 
end 


| sd Assembler: 


_iotafn_: 
Id.w = O(ap), al ;#4,X 
Id.w # 0x0000064,v1 5 #4 
ld.w # 0x0000004,vs ;#4 
Id.w  ._mth$r_indx,v0O ; #4, mth$r_indx 
st.w _-v0,0(a1) ;#4,X 
rtn ; #6 


« Vector of Indices 


Source: 


subroutine iotafn(x) 
real x(100) 
do i=1,100 

x(i) =i 


enddo 
return 
end 


@ Assembler: 


_iotafn_: 


load 


load 
load 


load 
stor 
rtn 


x,R1 


—100,VL 


4,VS 


_mth$r_indx, VO 
VO,x 


; starting address 
of array x 
; vector length 
; size of array 
elements 
invoke mth$r_indx 
; store values in x 


« Vector of Indices 


Source: 


subroutine iotafn(x) 
real x(100) 
do i=1,100 
x(i) =i 
enddo 
return 
end 


| © 


Compiled: 


subroutine iotafn(x) 
real x(100) 
C# # #3 [fc] Loop on line 3 of iotafn.f (DO : fully vectorized%% 
do i=1,100 
x(i) =i 
enddo 
return 
end 


- Vector of Indices - Scatter /Gather 


Source: 


do 10j = 1, 100 
ix(§) = 5 

10 continue 
do 20 j = 1, 100 


a(j) = b(ixG)) 


20 continue 


- Vector of Indices - Scatter /Gather 


Source: 


‘subroutine gath (a,b,c,ix) 
real*8 a(100), b(100), (100) 
integer*4 j, ix(100) 

###4 [fc] Loop on line 4 of gath.f (DO J) fully vectorizedA@i%i% 
do 10 j = 1, 100 
ix(j) =j 

10 continue 

C###7 [fc] Loop on line 7 of gath.f (DO J) fully vectorized%U%G 

do 20 j = 1, 100 
a(i) = b(ix() 

20 continue 
return 
end 


_gath_: 


Id.w 
Id.w 
Id.w 
Id.w 
st.w 
Id.w 
Id.w 


_Id.w 


Id.w 
Id.w 
Id.w 
Id.w 
mul.w 
Id.w 
add.w 
Idvi.l 
st.] 
rtn 


12(ap),al 

# 0x0000064,v1 
# 0x0000004,vs 
_mth$j_indx,v0O 
v0,0(a1) 

# 0x0000064,v1 


# 0x0000004,vs 


12(ap),a2 

# 0x0000008,s0 
4(ap),a5 
O(ap),a3 
O(a2),v1 
vi,s0O,v2 

# 0x0000008,vs 
# Oxfffffff8,a5 
v2,v3 

v3,0(a3) 

; #10 


. Vector of Indices - Scatter /Gather 


;#5,IX 


- Scalar Expansion 


Code: 
do i=1,100 
if(z(i).gt.0.0) then 
t = yl(i) 
else 
t = y2(i) 
endif 
xi) + 
enddo 
Becomes: 
do i=1,100 


if(z(i).gt.0.0) then 
temp(i) = y1(i) 


else 
temp(i) = y2(i) 
endif 
x(i) = temp(i) 
enddo 


- Scalar Expansion 


Source: 


subroutine sclrex(x,y1,y2,z) 
real x(100),y1(100),y2(100),z(100) 


do i=1,100 
if(z(i).gt.0.0) then 
t = yl(i) 
else 
t = y2(i) 
endif 
x(i) =t 
enddo 
return 
end 
Compiled: 


subroutine sclrex(x,y1,y2,z) 
real x(100),y1(100),y2(100),z(100) 
C###3 [fc] Loop on line 3 of scalep.f (DO J) fully vectorized%%% 


do i=~1,100 
if(z(i).gt.0.0) then 
t = yl(i) 
else 
t = y2(i) 
endif 
x(i) = t 
enddo 
return 
end 


s 


e Scalar Expansion 


_sclrex_: 


sub.w 
sub.w 
Idea 
st.w 
ld.w 
Id.w 
ld.w 
ld.w 
Id.w 
ld.w 
Id.w 
Id.w 
ld.w 
It.s 
st.x 
Id.w 
ld.w 
ld.w 
sub.w 
mov 
not 
mov 
add.w 
mov 
not 
mov 
st.x 
ld.x 
mask.t 
st.w 
ld.w 
ld.x 
mask.t 
st.w 
Id.w 
st.w 
add.w 
rtn 


#0x0000028,sp 


#0x0000190,sp 


-440(fp),a1 
al,-40(fp) 

# 0x0000064,v] 
#0x0000004,vs 
12(ap),a2 
4(ap),a3 
8(ap),a5 
-40(fp),a4 
O(ap),al 

# 0x0000000,s0 
0(a2),v0 

s0,v0 
vm,-20(fp) 
0(a3),v1 
0(a5),v5 
0(a4),v2 

sl,sl 

sl,vm,s2 

s2,s2 

sl,s2,vm 
#0x0000001,s1 
sl,vm,s2_ 
s2,s2 

sl1,s2,vm 


- vm,-36(fp) 


-20(fp),vm 
v2,v1,v3 
v3,0(a4) 
0(a4),v4 
-36(fp),vm 
v4,v5,v6 
v6,0(a4) 
0(a4),v7 
v7,0(a1) 
#0x0000190,sp 
3 #11 


3 #1 
3 #3, ?push 


3 #4, Z 
3 #4, $ce_ xtemp0 
3 #4, $cg_xtemp0 


3 #3 

3 #3, $cg_xtempl 
3 #5, $ce_ xtemp0 
3; #5, $cg_xtemp0 
3 #5, v_1 

3 #7, v_1 

3 #7, $cg_xtempl 
3 #7, $cg_xtempl1 
3 #7, v_1 


- Vectorization of Conditionals 


Source: 


subroutine cond (a,b,c) 
real*8 a(100), b(100), c(100) 
integer*4 i, j 

do 10 i= 1, 100 

if ( a(i) .gt. 0.0 ) then 


c(i) = a(i) * b(i) 
c(i) = b(i) 
endif 


10 continue 
do 20 j = 1, 100 
if ( a(j) .gt. 2.0 ) goto 15 
(i) = a(j) + bGi) 
go to 20 
15 continue 
c(i) = a(j) - b(i) 
20 continue 
return 
end 


else 


Vectorization of Conditionals 


Source: 


subroutine cond (a,b,c) 
real*8 a(100), b(100), ¢(100) 
cantee ee: 4i,j 
#4 [fc] Loop on line 4 of cond.f (DO a fully vectorized%%% 
me 10 1 = 1, 100 
if ( a(i) -gt. 0. 0) then 
c(i) = a(i) * b(i) 


c(i) = b(i) 
endif 


10 continue 
C###11 [fc] Loop on line 11 of cond.f (DO J) fully vectorized%%% 

do 20 j = 1, 100 
if ( a(j) -gt. 2.0 ) goto 15 
e(j) = a(j) + BG) 
go to 20 

15 continue 
e(j) = aj) - bG) 

20 continue 
return 
end 


else 


- Vectorization of Conditionals 


Code: 
DOI1,100 _ 
IF ISWTCH(I).GE.0) THEN 
X(I) = Y(1)*Z(1) 
ELSE 
X(I) = Y(1)-Z(1) 
ENDIF 
ENDDO 


Both clauses of the IF are computed, and 
the results are masked together. 


« Vectorization of Conditionals 


Source: 


SUBROUTINE COND(X,Y,Z,ISWTCH,]) 
REAL X(100),¥(100),Z(100) 
INTEGER I,ISWTCH(100) 
DO I = 1,100 
IF(ISWTCH(I).GE.0) THEN 
X(I) = ¥(I)*Z(1) 
ELSE 
X(I) = Y(I)-Z(1) 
ENDIF 


ENDDO 
RETURN 
END 


Compiled: 


SUBROUTINE COND(X,Y,Z,ISWTCH,]) 
REAL X(100),Y(100),Z(100) 
INTEGER I,ISWTCH(100) 
C###4 [fc] Loop on line 4 of t1.f (DO I) fully vectorized%%% 
DOI= 1,100 | 
IF(ISWTCH(I).GE.0) THEN 
X(D) = Y(D*Z(D 
ELSE 
X(I) = Y(I)-Z(I) 
ENDIF 
ENDDO 


RETURN 
END 


e Vectorization of Conditionals 


_cond_: 


sub.w 
Id.w 
st.w 
Id.w 
st.w 
Id.w 
Id.w 
Id.w 
Id.w 
Id.w 
Id.w 
Id.w 
Id.w 
Id.w 
Id.w 
le.w 
st.x 
Id.w 
ld.w 
sub.w 
mov 
not 
mov 
add.w 
mov 
not 
mov 
st.x 
Id.x 
mask.f 
mul.s 
Id.x 
mask.f 
sub.s 
mask.t 
st.w 
Id.w 
st.w 
rtn 


#0x0000020,sp 
#0x0000001,s0 
s0,@ 16(ap) 
#0x0000001,s1 
s1,@16(ap) 
#0x0000064,v] 
#0x0000004,vs 
12(ap),al 


. 4(ap),a3 


8(ap),a2 
O(ap),a4 

# 0x0000000,s5 
# 0x40800000,s6 
# 0x0000000,s2 
O(al1),vO 

s2,v0 
vm,-16(fp) 
0(a3),v3 
O(a2),v1 

$3,s3 

s3,vm,s4 

s4,s4 

s3,s4,vm 
#0x0000001,s3 
s3,vm,s4 

s4,s4 

s3,s4,vm 
vm,-32(fp) 
-16(fp),vm 
v1,s6,v5 
v3,v5,v6 
-32(fp),vm 
vi1,s5,v2 
v3,v2,v4 
v6,v4,v7 
v7,0(a4) 

# 0x0000065,s7 
s7,@ 16(ap) 

3 #11 


3; #1 

3 #4 

3 #4, I 

3 #4 

3; #4, I 

3 #5 

3 #6 

3; #5, ISWTICH 
3; #6, Y 


3; #5, ISWTCH 
3 #5, $ce2_ xtemp0 
; #5, $ce_ xtemp0 


3 #4, $cg_xtempl 
; #6, $cg_ xtemp0 
; #6, $cg_xtemp0 
3; #6 

; #8, $cg xtempl 
3 #8, $cg_ xtempl 
3; #8 

; #8, $cg_ xtempl 
3 #8, xX 

3 #4 

3 #4, I 


- Partial Vectorization 


_ Source: 
doi=~I,n 
a0): a(i-1) a b(i) * c(i) 
enddo 
Becomes: 

doi~I,n 
t(i) = b(i) * c(i) 
enddo 

e doi~—I,n 
a(i) = a(i-1) + t(i) 
enddo 


- Partial Vectorization 


Source: 


subroutine part (a,b,c,n) 

real*8 a(100), b(100), c(100) 

integer*4 n 

doi=i,n 

a(i) = a(i-1) + b(i) * ei) © 
enddo 

return 

end 


Compiled: 


subroutine part (a,b,c,n) 
@ real*8 a(100), b(100), c(100) 
integer™4 n 
#4 [fc] Loop on line 4 of part.f (DO J) partially vectorized%%% 
#4 [fc] Loop on line 4 of part.f (DO I) The assignment to A 
on line 5 appears to be in a recurrence%%% 
doi=i,n 
a(i) = a(i-1) + b(i) * c(i) 
enddo 
return 
end 


- Partial Vectorization 


_part_: 


Ms: 


sub.w 
Id.w 
le.w 
jbrs.f 
Id.w 
Id.w 
Idea 
st.w 
st.w 
shf 
neg.w 
sub.w 
add.w 
st.w 
st.w 
st.w 


Id.w 
Id.w 


#0x0000030,sp 
@ 12(ap),sO 


~ #0x0000001,s0 — 


M2 
#0x0000000,s1 
@12(ap),al 
-48(fp),a3 
s1,-36(fp) 
al,-16(fp) 
#0x0000003,al 
al,a2 

al,sp 

a2,a3 
al,-12(fp) 
al,-40(fp) 
a3,-48(fp) 


-16(fp),a2 
#0x0000008,vs 


- Partial Vectorization 


M4: 


Id.w 
Id.w 
mov 
mov 
mov 
add.w 
add.w 
Id.1 
Id.w 
add.w 
st.w 
Id.w 
add.w 
Id.1 
mul.d 
add.w 
st.] 
It.w 
jbra.t 
st.w 
Id.w 
Id.w 
add.w 
st.w 
Id.w 


-36(fp),a4 
4(ap),al 
a2,a3 

a4,a5 

a3,vl 

# Ox ffffff80,a2 
ad,al 
O(al1),v0 
-48(fp),al 
#0x0000400,a4 
a4,-36(fp) 
8(ap),a4 
ad,a4 
O(a4),v1 
v0,v1,v2 
al,a5 
v2,0(a5) 


- ¢0x0000000,a2 


M4 
a2,-16(fp) 

# Oxffffff8 ,s3 
-40(fp),s2 

# Ox fffffff0,s2 
s3,-24(fp) 
-24(fp),a3 


M6: 


Id.w 
ld.w 
mov 
mov 
add.w 
add.w 
add.w 
Id.1 
add.w 
Id.] 
ld.w 
add.w 
add.d 
le.w 
st.l 
jbra.t 
st.w 
ld.w 
add.w 


rtn 


- Partial Veétor mation 


-48(fp),a5 
O(ap),a2 

a3,a4 

a4,al 
#0x0000008,a3 
#0x0000008,a1 
a2,a4 

0(a4),s5 

al,ad 

0(a5),s4 
-32(fp),a5 - 
a2,al 

s4,s95 


* a3,a5 


s5,0(a1) 
M6 
a3,-24(fp) 
-12(fp),a4 
a4,sp 

3; Stmt -2 
3 #7 


@ 


- Detect Loop Iteration Count 
Source: 


subroutine count(a,b,c) 
real*8 a(10), b(10), c(10) 
integer*4 i, j 
do 10i—1,2 
a(i) = b(i) + c(i) 
10 continue 
do 20 j = 1,3 
a(i) — b(j) ++ c(i) 
20 continue 
return 
end 


Compiled: 


subroutine count(a,b,c) 
real*8 a(10), b(10), c(10) 
integer*4 i, j , 
###5 [fc] Loop on line 5 of count.f (DO I) not vectorized%%% 
C###5 [fc] Loop on line 5 of count.f (DO I) executed fewer 
Cc than 3 times%%% 
do 10i—1,2 
a(i) = b(i) + c(i) 
10 continue 
C###8 [fc] Loop on line 8 of count.f (DO J) fully vectorized%%% 
do 20 j = 1,3 
a(j) = b(5) + (3) 
20 continue 
return 
end 


e Accept All Data Types 


Source: 


subroutine all(ail,ai2,ai4,ai8,ar4,ar8,ac8,acl6, 


x __ allal2,al4,al8,bi1,bi2,bi4,bi8,br4, 
x '  br8,bc8,bc16,bl1,bl2,b14,b18,cil1,ci2, 
x ci4,ci8,cr4,cr8,cc8,cc16,cl1,cl2,cl4,cl8) 


integer*1 ai1(100), bi1(100), ci1(100) 
integer*2 ai2(100), bi2(100), ci2(100) 
integer*4 ai4(100), bi4(100), ci4(100) 
integer*8 ai8(100), bi8(100), ci8(100) 
real*4 ar4(100), br4(100), cr4(100) 
real*8 ar8(100), br8(100), cr8(100) 
complex*8 ac8(100), bc8(100), cc8(100) 
complex*16 ac16(100), bc16(100), cc16(100) 
logical*1 al1(100), bl1(100), cl1(100) 
@ logical*2 al2(100), bl2(100), cl2(100) 
logical*4 al4(100), bl4(100), cl4(100) 
logical*8 al8(100), b18(100), cl8(100) 


integer*4 j 


do20j—1,100 - 
ail(j) = bil(j) + cil() 
ai2(}) — bi2()) * i2()) 
ai4(j) = bi4(j) + ci4(j) 
ai8(j) = bis(}) * ci8(j) 
ar4(j) = br4(j) + cr4(j) 
ar8(j) = br8(j) * cr8(j) 
ac8(j) = bc8(j) + cc8(j) 
acl6(j) = be16(j) * ec16(j) 
all(j) = bl1(j) .and. cl1(j) 
al2(j) = bl2(j) .or. cl2(j) 
al4(j) = bl4(j) .and. cl4(j) 
; al8(j) = bI8(j) .or. cl8(j) 
0 continue 

| @ return 

end 


e Accept All Data Types 


Compiled: 
subroutine all(ail,ai2,ai4,ai8,ar4,ar8,ac8,acl6, 
x all ,al2,al4,al8,bil,bi2,bi4,bi8 br4, 
x br8,bc8,bc16,bl1,b12,bl4,bI8,ci1,ci2, 
x ci4,ci8,cr4,cr8,cc8,cc16,cl1,cl2,cl4,cl8) 


integer*1 ail(100), bi1(100), ci1(100) 
integer*2 ai2(100), bi2(100), ci2(100) 
integer*4 ai4(100), bi4(100), ci4(100) 
integer*8 ai8(100), bi8(100), ci8(100) 
real*4 ar4(100), br4(100), cr4(100) 
real*8 ar8(100), br8(100), cr8(100) 
complex*8 ac8(100), bc8(100), cc8(100) 
complex*16 ac16(100), bc16(100), cc16(100) 
logical*1 ali(100), bl1(100), cl1(100) 
@ logical*2 al2(100), b12(100), cl2(100) 
logical*4 al4(100), b14(100), cl4(100) 
logical*8 al8(100), b18(100), cl18(100) 
C 
integer*4 j 


c 
C## #24 [fc] Loop on line 24 of all.f (DO J) fully vectorized%%% 
do 20 j = 1, 100 
ail(j) == bil(j) + cil(j) 
ai2(j) = bi2(j) * ci2(j) 
ai4(j) = bi4(j) + ci4(j) 
ai8(j) = bi8(j) * ci8(j) 
ar4(j) = br4(j) + cr4(j) 
ar8(j) == br8(j) * cr8(j) 
ac8(j) = bce8(j) + cc8(j) 
acl6(j) = bc16(j) * cc16(j) 
all(j) == bl1(j) .and. cl1(j) 
al2(j) = bl12(j) -or. cl2(j) 
al4(j) = bl4(j) -and. cl4(j) 
al8(j) = b18(j) .or. cl8(j) 
& 20 continue 
return 
end 


© VECTORIZATION 
LIMITATIONS 


Loops containing 
e Character data or operations 
e Computed/assigned goto’s 
e Function/subroutine references 
e I/O statements 
e Multiple exits 


e Loops whose DO parameter varies with 
respect to the outer loop 


e Equivalenced variables 
e Recurrences 


e Certain IF constructs 


@ 


e Character Variables 


Source: 


program cmp 

character*100 stri1,str2 

logical ieq 

ieq=—.true. 

do i=1,100 | 
if(str1(i:i).ne.str2(i:i)) ieq—=.false. 

enddo 


end 


Compiled: 


program cmp 

character*100 str1,str2 

logical ieg 

ieq=.true. 
C###5 [fc] Loop on line 5 of cmp.f (DO J) not vectorized%%% 
C###5 [fc] Loop on line 5 of emp.f (DO J) contains character 
Cc expressions%%% 

do i=1,100 

if(str1(i:i).ne.str2(i:i)) ieq—=.false. 
enddo 
end 


| © 


e GO TO Statements 


Source: 


subroutine gtos(a,b,imx,idir) 
integer*4 j, imx, idir 
real*8 a(imx), b(imx) 
do 100 j = 1, imx 
goto (20,40,60) idir 
a(j) = 4.0 * bGj) 
go to 60 

20 continue 
a(j) = bij) 
go to 100 

40 continue 
a(j) = -b(j) 
go to 100 

60 continue 

100 continue 
return 
end 


e GO TO Statements 


Compiled: 


subroutine gtos(a,b,imx,idir) 
integer*4 j, imx, idir 
real*8 a(imx), b(imx) 
###4 [fc] Loop on line 4 of f1.f (DO J) not vectorized%%% 
###4 [fc] Loop on line 4 of f1.f (DO J) contains a computed goto%%% 
do 100 j — 1, imx 
goto (20,40,60) idir 
a(j) = 4.0 * bli) 
go to 60 
20 continue 
a(j) = b(}) 
go to 100 
40 continue 
a(j) = -b(j) 
go to 100 
60 continue 
100 continue 
return 
end 


oe. GOTO Statements 


Reworked Code: 


subroutine gtos(a,b,imx,idir) 
integer*4 j, imx, idir 
real*8 a(imx), b(imx) 
goto (10,30,50) idir 
10 continue 
do 20 j = 1, imx 
a(j) = 4.0 * b(j) 
20 continue 
return 
30 continue 
do 40 j = 1, imx 
a(j) = b(i) 
& 40 continue 
return 
50 continue 
do 60 j = 1, imx 
a(j) = -b{j) 
60 continue 
return 
end 


@ - GO TO Statements 


Compiled: 


subroutine gtos(a,b,imx,idir) 
integer*4 j, imx, idir 
real*8 a(imx), b(imx) 
goto (10,30,50) idir 
= continue 
##6 [Fc] Loop on line 6 of f2. 7 (D0 J) fully vectorized%%% © 
"do 20 j = 1, imx 
a(i) = 4.0 * b(j) 
20 continue 
return 
30 continue 
C###11 [fc] Loop on line 11 of f2.f (DO J) fully vectorized%%% 
. & do 40 j = 1, imx 
a(j) = bj) 
40 continue 
return 
50 continue 
C## #16 [fc] Loop on line 16 of f2.f (DO J) fully vectorized%%% 
do 60 j = 1, imx 
a(j) = -b(j) 
60 continue 
return 
end 


- Subroutine Calls 


Source: 


program vsub 
real * § a(100), b(100), c(100) 
do 10i = 1, 100 
a(i) = 1.0 
b(i) = 2.0 
10 continue 
do 20 } = 1, 100 
c(j) = 0.0 
call armult(c(j),a,b) 
20 continue 
stop 
end 


Compiled: 


program vsub 
real * 8 a(100), b(100), c(100) 
# ##3 [fc] Loop on line 3 of vsub.f (DO JI) fully vectorized%%% 
do 10 i — 1, 100 
a(i) = 1.0 
b(i) = 2.0 
10 continue 
C###8 [fc] Loop on line 8 of vsub.f (DO J) contains a 
C subroutine or function call%%% 
C# ##8 [fc] Loop on line 8 of vsub.f (DO J) not vectorized %%% 
do 20 j = 1, 100 
c(j) = 0.0 
call armult(c(j),a,b) 
20 continue 
stop 
end 


e 
« Input/Output 


Source: 


program ios 
real*8 a(100), b(100) 
integer*4 i 
do 10i= 1,100 
a(i) = 1.0 
b(i) = a(i) * 3.0 
‘print 100, b(i) 

10 continue 

100 format(3x,’value of b(i):’,f6.3) 
stop 
end 


Compiled: 


program ios 
real*8 a(100), b(100) 
integer*4 i 
C###4 [fc] Loop on line 4 of ios.f (DO J) not vectorized%%% 
C###4 [fc] Loop on line 4 of ios.f (DO I) performs 1/0%%% 
do 10 i — 1, 100 
a(i) = 1.0 
b(i) = a(i) * 3.0 
print 100, b(i) 
10 continue 
100 format(3x,’value of b(i):’,f6.3) 
stop 
end 


« Multiple Exits 


Terminates prematurely: 


do i = 1, 100 
if(x(i) .1t. 0.0) go to 100 
nelem — i 
x(i) = x(i) / 2.0 
enddo 
100 continue 


Abnormal conditions: 


do i= 1, 100 
if(x(i) .gt. le19) go to 900 
x(i) = x(i) ** 2 

enddo 


900 print *,’ error, x out of range’ 


eo 


« Multiple Exits 


Terminates prematurely: 


C# ##4 [fc] Loop on line 4 of trm.f (DO I) not vectorized%%% 
C# ##4 [fc] Loop on line 4 of trm.f (DO J) or a contained loop 
C has multiple exits 
do i = 1, 100 
if(x(i) .1t. 0.0) go to 100 
nelem =i 
x(i) = x(i) / 2.0 
enddo 
100 continue 


Abnormal conditions: 


C# ##4 [fc] Loop on line 4 of abn.f (DO J) not vectorized%%% 
C###4 [fc] Loop on line 4 of abn.f (DO J) or a contained loop 
C has multiple exits 


do i = 1, 100 
if(x(i) .gt. 1e19) go to 900 
x(i) = x(i) ** 2 

enddo 


900 print *,’ error, x out of range’ 


« Inner Loops With Varying Parameters 


Source: 


subroutine nested(x,y) 
real x(100,100),y(100) 
do j=1,100 

y(j) = y(j)**2 

do i=1,j 

x(j,i) a y(j) 

enddo 
enddo 
return 
end 


Compiled: | 


subroutine nested(x,y) 

real x(100,100),y(100) 
C###3 [fc] Loop on line 3 of nested.f (DO J) not vectorized%%% 
C###3 [fc] Loop on line 3 of nested.f (DO J) An induction variable 


Cc of a contained loop has a starting value or stride that 
Cc appears to vary with each iteration%%% 
do j=1,100 


y(j) = y(j)**2 
C###5 [fc] Loop on line 5 of nested.f (DO I) fully vectorized%%% 
do i=1,j 
x(j,i) = y(j) 

enddo 

enddo 

return 

end 


» Loops With Equivalenced Variables 


Source: 


program equivl 
integer*4 i1(100),i2(100) 
integer*2 i1b(200) 
equivalence (il(1),ilb(1)) 
do i=1,100 


Compiled: 


program equivl 
integer*4 i1(100),i2(100) 
integer*2 i1b(200) 
equivalence (il(1),i1b(1)) 


C###5 [fc] Loop on line 5 of equivl.f (DO I) not vectorized %%% 
C###5 [fc] Loop on line 5 of equiv1.f (DO I) An equivalenced variable 
C or array inhibits vectorization%Ww% 
do i=1,100 
ilb((i*2)-1) = 0 !set upper two bytes to 0 
i2(i) = i2(i) + il(i) land add to i2 
enddo 
end 
ilb((i*2)-1) = 0 lset upper two bytes to 0 
i2(i) = i2(i) + i1(i) land add to i2 
enddo 
end 


- Loops With Equivalenced Variables 


Source: 


program equivl 
integer*4 i1(100),i2(100) 
integer*2 i1b(200) 
equivalence (il(1),i1b(1)) 


C###5 [fc] Loop on line 5 of equiv1.f (DO I) not vectorized %%% 
C###5 [fc] Loop on line 5 of equiv1.f (DO I) An equivalenced variable 


C or array inhibits vectorization%%% 
do i=1,100 
ilb((i*2)-1) = 0 lset upper two bytes to 0 
i2(i) = i2(i) + i1(i) land add to i2 
enddo 
end 
After: 


program equiv2 
integer*4 i1(100),i2(100) 
integer*2 ilb(200) 
equivalence (i1(1),i1b(1)) 


C###5 [fc] Loop on line 5 of equiv2.f (DO I) not vectorized270%% 
C###5 [fc] Loop on line 5 of equiv2.f (DO I) An equivalenced variable 
or array inhibits vectorization%%% 
do i=1,100 
ilb((i*2)-1) = 0 !lset upper two bytes to 0 
enddo 


C###8 [fc] Loop on line 8 of equiv2.f (DO I) fully vectorized%%% 
do i=1,100 
i2(i) = i2(i) + il(i) land add to i2 
enddo 
end 


- Recursion 


_ Scalar Processing 


e Allows dependencies such as calculating 
the value of an element and immediately 
using that result to calculate the 
value of the next element in the vector. 


Vector Processing 


e All elements of a vector are treated 
exactly the same and in a group. 


« Vector elements must be independent 
of one another 


e The value of one element may not be 
dependent on the result of a 
calculation involving any other 
element in the group. 


e Recursion 


Recursion can occur two ways: 
e Result Not Ready 


The operations in the current iteration depend 
on results from a previous iteration. Doing 
the operations in parallel (vectorization) 
requires that the calculations for an 

iteration only depend on” old” values for a 

e vector. 


e Value No Longer Available 


Old values of a variable are modified as 

the current value is being used in calculations. 
Doing the operations in parallel implies 

that previous values and the current value 
are modified simultaneously. 


- Recursion 


Dependency analysis looks at pairs of 
uses of a variable and identifies three 
types of recursion 


e Conflict between usage-assignment pairs 


do i=1,100 
x(i) = x(-1) + y(i) 
enddo 


e Conflict between assignment-assignment 
pairs 


do i=1,100 
x(it+j) = y(i) 
x(i+k) = z(i) 
enddo 


e Conflict between assignment-usage pairs 


do i=1,100 
x(i-1) = y(i) 
z(i) = x(i) 


enddo 


- Recursion 


Usage-Assignment Recursion 


do i==2,100 


x(i) = x(-1) + y(i) 
enddo 
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- Recursion 


Assignment-Usage Recursion 


do i==2,100 


x(i-1) = y(i) 


x(i) = x(i) 


enddo 
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- Recursion 


Assignment-Assignment Recursion 


do i=1,98 


x(i) = y(i) 


x(i+2) = 2z(i) 


enddo 


ES SN ES ES AS eS 


ar ay 
Ye) { | ey! lia) | | | an! 
ie iat | ye 
Laat einee aes 
iar uae ea 
| | } | | | 
| | | | { | 
ia igh | | ie 
BER sees ae 
Agave u cee s 
lw a! la! | | | 1! 
Sars py tt pal 
pes Neier: aie Sad 
a ee ee ee et rc 
eae ce ies ee a 
Mm fait bd dda 
ese liste ale 
eas ya ee a 
| | i] | | | | | | | 
aa FP tt tt be 
| { | | | J | | 
cae ee ee ee iol 
Pie ie 
ae ee ee ee 
El tl aol wtiiol of of + tr 
ce | | i | 
aac e eae 
Een eee ae 
| { | | | | | | | I gl 
Peet OSS a alee 
aCe eee eS ae 
Vise tt bd td fal 
1 ¢Udy y | | fF F 4 | >| 
(a deh pe 


- Recursion 


Source: 


subroutine recur (a,n) 

real*8 a(100) 

integer*4 n 

doi=~I1,n 

a(i) = a(i-l) + 5 © 
enddo 

return 

end 


Compiled: 


subroutine recur (a,n) 

real*8 a(100) 

integer*4 n 
Cz # #4 [fc] Loop on line 4 of recur.f (DO I) not vectorized%%% 
Cz ##4 [fc] Loop on line 4 of recur.f (DO I) The assignment to A 
Cc on line 5 appears to be in a recurrence%%% 


doi=I,n 

a(i) = a(i-1) + 5 
enddo 

return 

end 


e Recurrence 


Code: 


46704 4 3a 
a(j) = a(j-1) + b(j) * e(j) 


10 continue 
Rewrite: 


ae 10j ~1,m 
t(j) = b(§) * eG) 


10 continue 


a6 20 j —=1,m 
a(j) = a(j-1) + t(j) 


20 continue 


EXCEPTIONS 


e Store Reversal 
Source: 


subroutine recur2(x,y,z) ~ 
real x(100),y(100),z(100) 
do i~=2,100 
x(i-1) = y(i) 
- 2(i) = x(i) 
enddo 
return 
end 


Compiled: 


subroutine recur2(x,y,z) 
real x(100),y(100),z(100) © 
Cz # #3 [fc] Loop on line 3 of t2.f (DO J) fully vectorized? _ 

do i=2,100 
x(i-1) = y(i) 
z(i) = x(i) 

enddo 

return 

end 


« Store Reversal 


; INSTRUCTIONS 
_recur2_: 
ld.w  # 0x0000063,v1 
Id.w  ¢ 0x0000004,vs 
Id.w 4(ap),a2 
Id.w  O(ap),al 
Id.w 8(ap),a3 
Id.w 4(a2),v1 
Id.w 4(al1),vO 
st.w — v0,4(a3) 
st.w - v1,0(a1) 
rtn ; #8 


s#5, Y¥ 


3; #5, X 


- Set Overlap Analysis — 


Source: 


subroutine recur3(x,y) 
dimension x(100),y(100) 
do i = 1,99,2 

<(i) = y(i) 

x(i+1) = -y(i+1) 
enddo — 
return 
end 


Compiled: 


subroutine recur3(x,y) 
dimension x(100),y(100) 
Cz « #3 [fc] Loop on line 3 of t2.f (DO 0 fully vectorized? 
do i = 1,99,2 
x(i) = y(i) 
x(i+1) = -y(i+1) 
enddo 
return 
end 


e Set Overlap Analysis 


; INSTRUCTIONS 
_recur3_: 
ld.w # 0x0000032,vl 
Id.w # 0x0000008,vs 
Id.w 4(ap),al 
Id.w  O(ap),a2 
Id.w 4(al),vl 
neg.s vl,v2 
st.w  v2,4(a2) 
Id.w = O(al),vO 
st.w ——- v0,0(a2) 
rtn 3; #7 


e Sum Reduction 


Source: 


subroutine sumred(xsum,x) 


real x(100),xsum 
xsum = 0.0 
do i~1,100 

xsum == xsum + x(i) 


enddo 
return 
end 

; INSTRUCTIONS | 

_sumred_: 
Ild.w # OxO000000,s0 
st.w s0,@ O(ap) 
ld.w #0x0000064,vl 
Id.w # 0x0000004,vs 
ld.w 4(ap),al 
Id.w  . @ O(ap),sl 
ld.w O(al1),vO 
mov.w s1,sO 
sum.s v0 
st.w s0,@ O(ap) 
rtn se 7 


EXTENSIONS 


- Compiler Directives 


NO_SIDE_EFFECTS 
SCALAR 


NO_RECURRENCE 


OPPORTUNITIES 


The major difference between 
the mathematical description of an 
algorithm and the program to 
execute it is the description and 
manipulation of the data structures. 


‘Three performance levels 


e Scalar 
e Vector 


e Super Vector 


Memory Access Considerations 


e Memory operations on consecutive memory 
locations (the first subscript of a 
FORTRAN array) are handled by the Saene 
bypass hardware, loading one double 
precision or two single precision 
elements every clock cycle. 


e Memory operations on non-consecutive 
memory locations are handled by the 
normal address-translation and cache 
hardware, and may take many cycles 
per element. 


- Memory Access Considerations 


Source: 


program memacc 
real*4 x(1000000),z(1000000) 


timel = extime(dummy) 
do i = 1,1000000,1 
x(i) = 0.0 
z(i) = 1.0 
enddo 
time2 = extime(dummy) 
& write(6,1000) time2-timel 


do istr = 1,10 
timel = extime(dummy) 
c$dir scalar 
do k = I, istr 
do i = 1,1000000,istr 
x(i) == 2(i) 
enddo 
enddo 
time2 = extime(dummy) 
write(6,1010) istr,time2-timel 
enddo 


stop 
1000 format(’ Initialization _: ’,f6.4,” sec.’) 
1010 format(’ Copy with stride ’,i2,’: ’,f6.4,’ sec.’) 
end 


Memory Access Considerations 
Compiled: 


program memacc 
real*4 x(1000000),z(1000000) 
timel = extime(dummy 


C###5 [Fe] Loop on line 5 of memacc.f (DO JI) fully vectorized 7% 
do i = 1,1000000,1 


x(i) = 0.0. 
z(i) =1.0 
enddo 
time2 = extime(dummy) 


write(6,1000) time2-timel 
###12 [fc] Loop on line 12 of memacc.f (DO ISTR) not vectorized%%% 
# ##12 [fc] Loop on line 12 of memacc.f (DO ISTR) 
An induction variable of a contained loop has a 
starting value or stride that appears to vary 
with each iteration %%% 
do istr = 1,10 
timel = extime(duminy) 
c$dir scalar 
###15 [fc] Loop on line 15 of memacc.f (DO K) vectorization 
inhibited by SCALAR directive%%% 
do k = 1, istr 
###16 [fc] Loop on line 16 of memacc.f (DO I) fully vectorized%%% 
do i = 1,1000000,istr 
x(i) = 2(i) 
eunddo 
enddo 
time2 = extime(dummy) 
write(6,1010) istr,time2-timel 
enddo 


stop 

1000 format(’ Initialization _: ’,f6.4,’ sec.’) 

1010 format(’ Copy with stride ’,i2,’: ’,f6.4,’ sec.’) 
end 


- Memory Access Considerations 
‘Timings: 


Initialization : 0.2683 sec. 

Copy with stride 1: 0.1511 sec. 
Copy with stride 2: 0.8134 sec. 
Copy with stride 3: 1.1837 sec. 
@ Copy with stride 4: 1.5401 sec. 
Copy with stride 5: 1.8226 sec. 
Copy with stride 6: 2.1754 sec. 
Copy with stride 7: 2.4533 sec. 
Copy with stride 8: 2.6608 sec. 
Copy with stride 9: 2.5925 sec. 


Copy with stride 10: 2.6643 sec. 


@ 


« Other Considerations 


Memory access time as a function 
of vector stride 


Vector spills and temps may overrun 
stack 


« Matrix Initialization 


~ Method 1: 


do j = I1,n 
doi=—I,n 
if (i.eq. j ) then 

a(i,j) = c 

else 
a(i,j) = 0.0 

end if 

end do 
® end do 


Method 2: 


doj —I,n 
do i= I,n 
a(i,j) == 0.0 
end do 
: a(j,J) aa 
end do 


e Matrix Initialization 


Method II 


aa 


timel = cputime (0.0 ) 
time2 = cputime ( timel ) 
overhd = time? - timel 
timel = cputime (0.0 ) 


c 
C# ##37 [fc] Loop on line 37 of diag.f (DO J) (distributed loop #1) 
C fully vectorized%%% . 

C## #37 [fc] Loop on line 37 of diag.f (DO J) (distributed loop #2) 
Cc fully vectorized%%% 

C# ##37 [fc] Loop on line 37 of diag.f (DO J) distributed, 

C forming 2 loops%%% 


do j =1,n 
doi=I,n 
a(i,j) = 0.0 
end do 
a(j,j) — 
end do 
c 
time2 = cputime ( timel ) 
time = time2 - time] - overhd 
print *, "Method IJ time: ’, time, ’ secs.’ 
c 
stop 
end 
Timing: 


Method I time: 0O.6618650 secs. 
Method Il time: 8.4289074E-02 secs. 


« Matrix Multiplication 


Method 1: 
do j —1,n 
do i =~ 1,n 
sum = 0.0 
do k —1,n 
sum = sum + a(j,k)*b(k,i) 
end do 
c(i,j) = sum 
end do 
end do 
Method 2: 
do j = 1,n 
do i = I,n 
c(i,j) = 0.0 
do k — 1,n 
c(i,j) = c(i,j) 5 a(j,k)*b(k,i) 
end do 
end do 
end do 


QO 


C# 


O09 


Matrix Multiplication 


program mult 
Two versions of a matrix multiply 


real*4 a(10,10), b(10,10), c(10,10), sum 
real*4 time,time! ,time2,ovrhd,cputime 
integer*4 i, j, n 

data a / 100*1.0/ , b / 100*1.0/ 

data n/10/ 


timel = cputime(0.0) 
time2 = cputime(timel1) 


ovrhd = time? - timel 


timel = cputime(0.0) 


###17 [fc] Loop on line 17 of mult.f (DO J) unable to 


distribute loop%%% 
##17 [fc] Loop on line 17 of mult.f (DO J) not vectorized %%% 
pees j=i,n 
#18 [fc] Loop on line 18 of mult.f (DO I) unable to 
distribute loop%%% 


C###18 [fc] Loop on line 18 of mult.f (DO I) not vectorized%%% 


do i= I,n 
sum = 0.0 
# #20 [fc] Loop on line 20 of mult.f (Do K) fully vectorized%%% 


dok = I1,n 
sum = sum + a(j,k)*b(k,i) 
end do 
c(i,jJ) = sum 
end do 
end do 


time2 = cputime(timel1) 
time = time? - timel - ovrhd 
print *, ’ Method I time: ’,time,’ seconds.’ 


@ 
Matrix Multiplication 
timel = cputime(0.0) 


_time2 = cputime(timel1) 
ovrhd = time?2 - timel 


Q 


timel = cputime(0.0) 


tk 
Ak 


##37 [fc] Loop on line 37 of mult.f (DO J) distributed, 
forming 2 loops%%% 
37 [fc] Loop on line 37 of mult.f (DO J) (distributed loop #1) 
fully vectorized%%% 
##37 [fc] Loop on line 37 of mult.f (DO J) interchanged to be 
innermost loop of nest%@%% 
# ##37 [fc] Loop on line 37 of mult.f (DO J) (distributed loop #2) 
fully vectorized%%% 


@ do j = 1,n 
doi=1I1,n 
c(i,j) = 0.0 
do k = 1,n 
e(i,j) = ce(i,j) + a(j,k)*b(k,i) 
end do 


end do 
end do 


4k 
4s 
4k 


qk 


OCOOOO 0.0? 


time2 = cputime(timel) 
time = time2 - timel - ovrhd 
print *, ’ Method II time: ’,time,’ seconds.’ 


c 
stop 
end 
Timings: 


& Method Itime: 1.6579996E-03 seconds. 
Method Il time: 9.5599983E-04 seconds. 


« Matrix Square Root 


Method 1: 


do j =1,n 
do i = j,n 
g(i,j) = 10,j) * sqrt(d(@j)) 
end do 


end do 
Method 2: 


do j —I,n 
doi=I,n 
(i,j) = 16,3) * sart(d(i)) 
end do 


end do 
Method 38: 


doi==I,n 
a(i) = sart( d(i) ) 
end do 
do j =I,n 
doi=~I,n 
(i,j) — Iii) *'dG5) 
end do 


end do 


« Matrix Square Root 


program square_root 


c 
c Square root of a symmetric, square matrix 
Cc ‘ 
c by the LDL decomposition 
Z 
real *4 1(512,512), g(512, plz), d(512), timel, time2, 
& time, overhd 
integer *4 n 
data n / 512 /,1/ 262144 * 0.0 /, g / 262144 * 0.0 / 
c : os 
c Putsomestuffinl&d 
¢ 
C###12 [fc] Loop on line 12 of sqroot.f (DO J) An induction 
C variable of a contained loop has a starting value or 
C stride that appears to vary with each iteration%%% 


C###12 [fc] Loop on line 12 of sqroot.f (DO J) not vectorized%i%% 
do j =I1,n 
C###13 [fc] Loop on line 13 of sqroot.f (DO J) fully vectorized%%% 
do i= j,n ! What’s wrong with this loop ? 
\(i,j) = 1.0 
end do 
end do 


c 
C###18 [fc] Loop on line 18 of sqroot.f (DO I) fully vectorized%%% 
doi~I,n 
d(i) = 2.0 
end do 


« Matrix Square Root 


c 
c Method I: 
c 
timel = cputime (0.0 ) 
time2 = cputime ( timel ) 
overhd = time? - timel 
timel — cputime ( 0.0 ) 
- 
C###29 [fc] Loop on line 29 of sqroot.f (DO J) not vectorized%%% 
C###29 [fc] Loop on line 29 of sqroot.f (DO J) An induction © 
C variable of a contained loop has a starting value or 
C stirde that appears to vary with each iteration%%% 


doj =—1,n 
C###30 [fc] Loop on line 30 of sqroot.f (DO J) fully vectorized%%% 
do i= j,n 
g(i,j) = 1(i,j) * sqrt(d(j)) 
end do 
end do 


time2 = cputime ( timel ) 
time = time?2 - timel - overhd 
print *, "Method I time: ’, time, ’ secs.’ 


qaqaagaan 


Matrix Square Root 


Method II: (include upper triangular elements in multiplication, 
even though we’re multiplying by zero. We get full 
vectorization, though ) 


timel = cputime ( 0.0 ) 
time2 == cputime ( timel ) 
overhd = time2 - timel 
timel = cputime (0.0 ) 


c 
C###48 [fc] Loop on line 48 of sqroot.f (DO J) fully vectorized%%% 


doj =1,n 
do i= I1,n 
g(ig) = (ig) * sart(d(j)) 
end do 
end do 


time2 = cputime ( time! ) 
time = time?2 - timel - overhd 
print *, "Method II time :’, time, ’ secs.’ 


. Matrix Square Root 


c 

c Method III: 

c 
timel = cputime (0.0 ) 
time2 — cputime ( timel ) 
overhd = time? - timel 


timel — cputime (0.0 ) 


c 
C###65 [fc] Loop on line 65 of sqroot.f (DO JI) fully vectorized%%% 
doi~1I,n 
d(i) = sqrt( d(i) ) 
end do 
C## #68 Fe) noon on line 68 of sgroot.f (DO J) fully Veron eee O7ae 
do j = 


ee ee 
g(i,j) = 105) * dG) 
end do 

end do 
c 

time2 == cputime ( timel ) 

time == time?2 - timel - overhd 

print *, Method III time: ’, time, ’ secs.’ 
c 

stop 

end 
Timing: 


Method I time: 6.7269996E-02 secs. 
Method Il time: 7.9784006E-02 secs. 
Method Til time: 4.8183024E-02 secs. 


- Matrix 'Transpose 


Method 1: 
do j = 1,500 
do i = 1,100 
atrans(i,j) = a(j,i) 
end do 
end do 
Method 2: 


call trans ( 500, 100 ) 


subroutine trans ( nrow, ncol ) 
common a(500,100), atrans(100,500) 


Cc 

do i = I,ncol 

do j = 1,nrow 
atrans(i,j) = a(j,i) 
end do 

end do 
c 

return 


end 


¢ 


- Matrix ‘Transpose 


Source: 


aaanaan 


program transpose 
common a(500,100), atrans(100,500) 


Illustrate matrix transposition for rectangular matrices in which the 
row dimension exceedes the column dimension. 


Method I: Unit stride for output (transposed) matrix 


timel = cputime (0.0 ) 
time2 = cputime (0.0 ) 
overhd = time2- timel 
timel = cputime (0.0 ) 


C###14 [fc] Loop on line 14 of transpose.f (DO J) fully vectorized%@%% 


do j = 1,500 
do i = 1,100 
atrans(i,j) = a(j,i) 
end do 
end do 


time2 = cputime (0.0 ) 
time = time2 - timel - overhd 
print *, "Method I time: ’, time, ’secs.’ 


« Matrix 'Transpose 


Q 


Q 


Method II: Get around loop interchange problem. 


timel = cputime ( 0.0 ) 

time? = cputime ( 0.0 ) 

overhd = time?2 - timel 

timel == cputime ( 0.0 ) 

call trans ( 500, 100 ) 

time2 == cputime (0.0) 

time = time? - timel - overhd 

print *, "Method II time : ’, time, ’secs.’ 


stop 
end 


subroutine trans ( nrow, ncol ) 
common a(500,100), atrans(100,500) 


c 
C###40 [fc] Loop on line 40 of transpose.f (DO I) fully vectorized%%% 


do i = I,ncol 


do j = I,nrow 
atrans(i,j) = a(j,i) 
end do 

end do 
c 

return 

end 
‘Timings: 


Method I time: 2.8062003E-O2secs. 
Method IL time: 1.8757001E-O02secs. 


« Polynomial Evaluation 


Method 1: 


do j = 1,10000 
p = a(n+1) 
do i = I,n 
p = x*p + a(n-i+1) 
end do 
end do 


'@ Method 2: 


do j = 1,10000 
tl = a(1) + a(2)*x 
t2 = a(3)*(x*x) + a(4)*(x*x)*x 
p=tl+t2 

end do 


« Polynomial Evaluation 
program polynomial 


Simple polynomial] evaluation. 


a 


real *4 a(4), x, p, time, timel, time2, cputime, overhd 
integer *4n 


datan/3/,x / 2.0 / 


c 
C###9 [fc] Loop on line 9 of poly.f (DO J) fully vectorized%%% 
doi=I1,n 


a(i) = 1.0 

end do 

c 
timel = cputime (0.0 ) 
time2 = cputime ( timel ) 
overhd = time2 - timel 
timel = cputime (0.0 ) 

c 


###18 [fc] Loop on line 18 of poly.f (DO J) not vectorized%%% 
C###18 [fc] Loop on line 18 of poly.f (DO J) unable to 


C distribute loop%%% 
do j = 1,10000 
p = a(n+1) 


C###20 [fc] Loop on line 20 of poly.f (DO I) not vectorized%%% 
C# # #20 [fc] Loop on line 20 of poly.f (DO I) has insufficient 


C vectorizable code%%M% 
C# # #20 [fc] Loop on line 20 of poly.f (DO I) The assignment to P on 
C line 21 appears to be in a recurrence%%% 
doi~1,n 
p =x*p+a(n-i+1 
end do 
end do 
c 
time2 = cputime ( timel ) 


time = time2 - timel - overhd 
print *, "Method I time : ’, time, ” secs.’ 


¢ Polynomial Evaluation 


c 
timel — cputime ( 0.0 ) 
time2 == cputime ( timel ) 
overhd = time?2 - timel 
timel = cputime ( 0.0 ) 
c 
C3DIR SCALAR 
C###35 [fc] Loop on line 35 of poly.f (DO J) vectorization inhibited 
Cc by SCALAR directive%%% 


do j = 1,10000 
tl = a(1) + a(2)*x 
t2 = a(3)*(x*x) + a(4)*(x*x)*x 


p=tl + t2 

end do 
c 

time2 = cputime ( timel ) 

time = time?2 - timel - overhd 

print *, "Method II time: ’, time, ’ secs.’ 
c 

stop 

end 
Timing: 


Method I time: 0.1014940 secs. 
Method II time: 3.2079965E-03 secs. 


« Boundary Conditions 
Method 1: 


do 20 j = 2, 100 
do 10 i = 1, 100 
if (i.ne. 1) then 

a(i,j) = b(i,j) 
else 
a(i,j) = 0.0 
endif 
10 continue 
20 continue 


Method 2: 


do 40 j = 2, 100 
a(1,j) = 0.0 
do 301 = 2, 100 
a(i,j) aa b(i,j) 
30 continue 
40 continue 


i @ 


- Boundary Conditions 


c 
timel = cputime(0.0) 
time2 == cputime(time1) 
ovrhd = time?2 - timel 

c 


timel = cputime(0.0) 


c 
C# ##31 [fc] Loop on line 31 of bnd.f (DO J) (distributed loop #1) 


C fully vectorized%%% 
C# ##31 [fc] Loop on line 31 of bnd.f (DO J) (distributed loop #2) 
C fully vectorized%%% 
C###31 [fc] Loop on line 31 of bnd.f (DO J) distributed, 
C forming 2 loops%%% 
@ do 40 j = 2,100 
a(1,j) = 0.0 


do 30 i = 2, 100 
a(i,j) = b(iJ) 
30 continue 
40 continue 


c 
time2 = cputime(time1) 
time = time? - timel - ovrhd ; 
print *,’ Method II time: ’,time,’ seconds.’ 
stop 
end 
Timings: 


Method I time: 7.8180004E-03 seconds. 
Method Il time: 2.6069973E-03 seconds. 


| 


- Boundary Conditions 


_ program bnds 
real*8 a(100,100), b(100,100) 
real*4 time, timel, time2, ovrhd, cputime 
integer*4 i, j 
timel = cputime(0.0) 
time2 == cputime(timel1) © 
ovrhd = time2 - timel 
c 
timel — cputime(0.0) 


c 
C###11 [fc] Loop on line 11 of bnd.f (DO J) fully vectorized%U% 
do 20 j = 2, 100 
do 101i — 1, 100 
6 if (i.ne. 1) then 
a(i,j) —= b(i,j) 


a(i,j) = 0.0 


else 


endif 
10 continue 
20 continue 
c 
time2 = cputime(timel) 
time = time?2 - timel - ovrhd 
print *,’ Method I time: ’,time,’ seconds.’ 


RESTRUCTURING 
| CODE 


Restructuring is the process by which 
existing source code is examined and 
modified in order to increase 


vectorization and optimization. 


- Profiling 


Identify which routine or areas of code account 
for significant portions of the total CPU time used. 


This is accomplished by using the (-p) option for 
compiling and linking the code. 


- Profiling 


Example of a profile 


 Setime 


COOWwWIi nna mo 
ono Pann I WOAWi 


CcCumsecs 


43.07 

66.48 

86.47 

99.10 

109.68 
120.13 
130.38 
139.05 
144.78 
146.18 
147.50 
148.75 
149.75 
150.25 
150.68 
151.04 
151.35 
151.57 
151.78 
151.97 
152.16 
152.34 


# call 


3634464 
1632 
408 
1632 
408 

408 
412488 
408 


247248 
244800 
5714 
408 
408 
408 
408 
1006 


ms/call 


0.01 
12.25 
30.96 
6.48 
25.61 
25.12 
0.02 
14.04 


0.01 
0.01 
0.18 
1.23 
1.05 
0.88 
0.76 
0.22 
0.15 
190.00 
0.01 
0.18 


name 


mcecount 


_cvmgm_ 
_monot_ 
_flaten_ 
_interp_ 


_States_ 
_riemann_ 
_mth$r_sqrt 
_detect_ 
_monstartup 
_cvmgp_ 
_cvVvmgz_ 
_mth$vr_sqrt 
_hydrow_ 
_tstep_ 
_coeff_ 
_intrfc_ 

_evt 
_for$do_fio 
_MAIN__ 
_X_putc 
_wrt_E 


- Compiler Messages 


Examine the compiler generated messages. 


cvmgm.f: 
evingp.Jf: 
cvm@z.f: 


detect-f: 

Loop on line 38.1 of detect.f (DO J) fully vectorized 

Loop on line 45.1 of detect.f (DO I) fully vectorized 

Loop on line 49.1 of detect.f (DO J) fully vectorized 

loop on line 54.1 of detect.f (DO I) contains a subroutine or function call 
loop on line 54.1 of detect.f (DO I) not vectorized 

Loop on line 63.1 of detect.f (DO I) contains a subroutine or function c: ill 
Loop on line 63.1 of detect.f (DO I) not vectorized : 
Loop on line 67.1 of detect.f (DO I) fully vectorized 

Loop on line 72.1 of detect.f (DO JI) contains a subroutine or function call 
Loop on line 72.1 of detect.f (DO I) not vectorized 

Loop on line 78.1 of detect.f DO I) contains a subroutine or function call 
Loop on line 78.1 of detect.f (DO I) not vectorized 

Loop on line 86.1 of detect.f (DO I) fully vectorized 

Loop on line 91.1 of detect.f (DO I) fully vectorized 


